Dense video captioning based on local attention
نویسندگان
چکیده
Dense video captioning aims to locate multiple events in an untrimmed and generate captions for each event. Previous methods experienced difficulties establishing the multimodal feature relationship between frames captions, resulting low accuracy of generated captions. To address this problem, a novel Video Captioning Model Based on Local Attention (DVCL) is proposed. DVCL employs 2D temporal differential CNN extract features, followed by encoding using deformable transformer that establishes global dependence input sequence. Then DIoU TIoU are incorporated into event proposal match algorithm evaluation during training, yield more accurate proposals hence increase quality Furthermore, LSTM based local attention designed enabling word correspond relevant frame. Extensive experimental results demonstrate effectiveness DVCL. On ActivityNet Captions dataset, performs significantly better than other baselines, with improvements 5.6%, 8.2%, 15.8% over best baseline BLEU4, METEOR, CIDEr, respectively.
منابع مشابه
Video Captioning with Multi-Faceted Attention
Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an extensible approach to jointly leverage several sorts of visual features and sema...
متن کاملBidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predomina...
متن کاملEnd-to-End Dense Video Captioning with Masked Transformer
Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevent...
متن کاملImproving video captioning for deaf and hearing-impaired people based on eye movement and attention overload
Deaf and hearing-impaired people capture information in video through visual content and captions. Those activities require different visual attention strategies and up to now, little is known on how caption readers balance these two visual attention demands. Understanding these strategies could suggest more efficient ways of producing captions. Eye tracking and attention overload detections ar...
متن کاملSpatio-Temporal Attention Models for Grounded Video Captioning
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video da...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Iet Image Processing
سال: 2023
ISSN: ['1751-9659', '1751-9667']
DOI: https://doi.org/10.1049/ipr2.12819